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Abstract 

Max-convolution is an important problem closely resembling stan¬ 
dard convolution; as such, max-convolution occurs frequently across many 
fields. Here we extend the method with fastest known worst-case run¬ 
time, which can be applied to nonnegative vectors by numerically ap¬ 
proximating the Chebyshev norm || ■ IK, and use this approach to de¬ 
rive two numerically stable methods based on the idea of computing p- 
norms via fast convolution: The first method proposed, with runtime in 
0(fclog(fc) log(log(fc))) (which is less than 18fclog(fc) for any vectors that 
can be practically realized), uses the p-norm as a direct approximation 
of the Chebyshev norm. The second approach proposed, with runtime 
in 0(fclog(fc)) (although in practice both perform similarly), uses a novel 
null space projection method, which extracts information from a sequence 
of p-norms to estimate the maximum value in the vector (this is equivalent 
to querying a small number of moments from a distribution of bounded 
support in order to estimate the maximum). The p-norm approaches are 
compared to one another and are shown to compute an approximation of 
the Viterbi path in a hidden Markov model where the transition matrix 
is a Toeplitz matrix; the runtime of approximating the Viterbi path is 
thus reduced from 0(nk 2 ) steps to 0(nk log(fc)) steps in practice, and is 
demonstrated by inferring the U.S. unemployment rate from the S&P 500 
stock index. 


1 



1 Introduction 


Max-convolution occurs frequently in signal processing and Bayesian inference: 


it is used in image analysis (Ritter and Wilson 2000), in network calculus (Boyer 


et al. 2013), in economic equilibrium analysis (Sun and Yang 2002), and in a 


probabilistic variant of combinatoric generating functions, wherein information 
on a sum of values into their most probable constituent parts (e.g . identifying 
proteins from mass spectrometry (Scr ang et al.| [2010 Serang 2014)). Max- 
convolution operates on the semi-ring (max, x), meaning that it behaves iden¬ 
tically to a standard convolution, except it employs a max operation in lieu of 
the + operation in standard convolution (max-convolution is also equivalent to 
min-convolution, also called infimal convolution, which operates on the tropical 
semi-ring (min, +)). Due to the importance and ubiquity of max-convolution, 
substantial effort has been invested into highly optimized implementations (e.g., 


implementations of the quadratic method on GPUs; Zach et al., 2008). 


Max-convolution can be defined using vectors (or discrete random variables, 
whose probability mass functions are analogous to nonnegative vectors) with the 
relationship M = L + R. Given the target sum M = m, the max-convolution 
finds the largest values L[(] and R[r) for which m = £ + r. 


M[m] = max L[^]i?[r] 

£,r: m=l-\-r 

= ma xL[l]R[m — £] 

= (L * max R ) [m] 


where * max denotes the max-convolution operator. In probabilistic terms, this is 
equivalent to finding the highest probability of the joint events Pr(L = l, R = r) 
that would produce each possible value of the sum M = L + R (note that in the 
probabilistic version, the vector M would subsequently need to be normalized 
so that its sum is 1). 


Although applications of max-convolution are numerous, only a small num¬ 
ber of methods exist for solving it (Serang 2015). These methods fall into 


two main categories, each with their own drawbacks: The first category 
consists of very accurate methods that are have worst-case runtimes either 


quadratic ( 

Bussieck et al. 

1994 

) or 

the worst-case ( 

Bremner et al. 2006 


1994) or slightly more efficient than quadratic in 


computes a numerical approximation to the desired result, but in 0(k log 2 (fc)) 
steps; however, no bound for the numerical accuracy of this method has been 
derived (Serang 2015). 

While the two approaches from the first category of methods for solv¬ 
ing max-convolution do so by either using complicated sorting routines or 
by creating a bijection to an optimization problem, the numerical approach 
solves max-convolution by showing an equivalence between * max and the pro¬ 
cess of first generating a vector u( m ' 1 for each index m of the result (where 
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_ L[£\R[m — £\ for all in-bounds indices) and subsequently computing 
the maximum M[m ] = max^ u [<?]. When L and R are nonnegative, the maxi¬ 
mization over the vector can be computed exactly via the Chebyshev norm 

M[m] = maxa^[t] 

= lim ||u (m) || p 


but requires 0{k 2 ) steps (where k is the length of vectors L and R). However, 
once a fixed p*-norm is chosen, the approximation corresponding to that p* can 
be computed by expanding the p*-norm to yield 


lim \\u {m) 

p—>oo 


1 

(£(<•'”>[<)" 

(v ’■[’« - (f')" 

fe ( LP ') [f| ( R,, ') [m “ <1 

l 

^ L p * R p ^ P [m\ 


1 

P* 


where L p = ( (E[0]) p ,(L[1]) P , ..., (L[k — l]) p ) and * denotes standard 
convolution. The standard convolution can be done via fast Fourier transform 
(FFT) in 0(k\og 2 {k)) steps, which is substantially more efficient than the 0{k 2 ) 
required by the naive method (Algorithm [l]). 


To date, the numerical method has currently demonstrated the best speed- 
accuracy trade-off on Bayesian inference tasks, and can be generalized to mul¬ 
tiple dimensions ( i.e ., tensors). In particular, they have been used with proba- 


bilistic convolution trees (Serang 2014 

) to efficiently compute the most probable 

values of discrete random variables A 0 , A l5 ... X n _\ for which the sum is known 

Xq + X\ + ... X n _i — y ( 

Serang 

2014 

). The one-dimensional variant of this 


problem (i.e., where each X-, is a one-dimensional vector) solves the probabilis¬ 
tic generalization of the subset sum problem, while the two-dimensional variant 
(i.e., where each X is a one-dimensional matrix) solves the generalization of the 
knapsack problem (note that these problems are not NP-hard in this specific 
case, because we assume an evenly-spaced discretization of the possible values 
of the random variables). 

However, despite the practical performance that has been demonstrated by 
the numerical method, only cursory analysis has been performed to formalize the 
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influence of the value of p* on the accuracy of the result and to bound the error 
of the p*-norm approximation. Optimizing the choice of p* is non-trivial: Larger 
values of p* more closely resemble a true maximization under the p*-norm, but 
result in underflow (note that in Algorithm [lj the maximum values of both 
L and R can be divided out and then multiplied back in after max-convolution 
so that overflow is not an issue). Conversely, smaller values of p* suffer less 
underflow, but compute a norm with less resemblance to maximization. Here 
we perform an in-depth analysis of the influence of p* on the accuracy of numer¬ 
ical max-convolution, and from that analysis we construct a modified piecewise 
algorithm, on which we demonstrate bounds on the worst-case absolute error. 
This modified algorithm, which runs in 0{k log(fc) log(log(fc))) steps, is demon¬ 
strated using a hidden Markov model describing the relationship between U.S. 
unemployment and the S&P 500 stock index. 

We then extend the modified algorithm and introduce a second modified 
algorithm, which not only uses a single p-norrn as a means of approximating 
the Cliebyshev norm, but instead uses a sequence of p-norms and assembles 
them using a projection as a means to approximate the Chebysliev norm. Using 
numerical simulations as evidence, we make a conjecture regarding the relative 
error of the null space projection method. In practice, this null space projection 
algorithm is shown to have similar runtime and higher accuracy when compared 
with the piecewise algorithm. 


2 Methods 

We begin by outlining and comparing three numerical methods for max- 
convolution. By analyzing the benefits and deficits of each of these methods, 
we create improved variants. All of these methods will make use of the basic 
numerical max-convolution idea summarized in the introduction, and as such we 
first declare a method for computing the numerical max-convolution estimate 
for a given p* as numericalMaxConvolveGivenPStar (Algorithm [I]). 


Algorithm 1 Numerical max-convolution given a fixed p*, a numerical 
method to estimate the max-convolution of two PMFs or nonnegative vectors. 
The parameters are two nonnegative vectors L' and R' (both scaled so that they 
have maximal element 1) and the numerical value p* used for computation. The 
return value is a numerical estimate of the max-convolution L' * max R' ■ 

1: procedure numericalMaxConvolveGivenPStar(Z/, R' , p*) 

2: Ml, vL[£] <- L[£\ p * 

3: Mr, vR[r] <— R[r] p 

4: vM <— vL * vR- i > Standard FFT convolution is used here 

5: Mm, M' [m] <— vM[m] p* 

6: return M' 

7: end procedure 
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2.1 Fixed Low-Value p* = 8 Method: 

The effects of underflow will be minimal (as it is not very far from standard 
FFT convolution, an operation with high numerical stability), but it can still 
be imprecise due to numerical “bleed-in” (z.e. error due to contributions from 
non-maximal terms for a given because the p*-norm is not identical to the 
Chebyshev norm). Overall, this will perform well on indices where the exact 
value of the result is small, but perform poorly when the exact value of the 
result is large. 

2.2 Fixed High-Value p* = 64 Method: 

As noted above, will offer the converse pros and cons compared to using a low 
p *: numerical artifacts due to bleed-in will be smaller (thus achieving greater 
performance on indices where the exact values of the result are larger), but 
underflow may be significant (and therefore, indices where the exact results of 
the max-convolution are small will be inaccurate). 

2.3 Higher-Order Piecewise Method: 

The higher-order piecewise method formalizes the empirical cutoff values found 
in Serang 2015; previously, numerical stability boundaries were found for each 
p* by computing both the exact max-convolution (via the naive 0(k 2 ) method) 
and via the numerical method using the ascribed value of p*, and finding the 
value below which the numerical values experienced a high increase in relative 
absolute error. 

Those previously observed empirical numerical stability boundaries can be 
formalized by using the fact that the employed mimpy implementation of FFT 
convolution has high accuracy on indices where the result has a value > r 
relative to the maximum value; therefore, if the arguments L and R are both 
normalized so that each has a maximum value of 1, the fast max-convolution 
approximation is numerically stable for any index m where the result of the 
FFT convolution, i.e. vM[m], is > r. The numpy documentation defines a 
conservative numeric tolerance for underflow r = 10“ 12 , which is a conservative 
estimate of the numerical stability boundary demonstrated in Figure [T1 (those 
boundary points occur very close to the true machine precision e ~ 10^ 5 ). 

Because Cooley-Tukey implementations of FFT-based convolution (e.g., the 
numpy implementation) are widely applied to large problems with extremely 
small error, we will make a simplification and assume that, when constraining 
the FFT result to reach a value higher than machine epsilon (+ tolerance thresh¬ 
old), the error from the FFT is negligible in comparison to the error introduced 
by the p*-norm approximation. This is firstly because the only source of nu¬ 
merical error during FFT (assuming an FFT implementation with numerically 
precise twiddle factors) on vectors in [0, l] fc will be the result of underflow from 
repeated addition and subtraction (neglecting the non-influencing multiplication 
with twiddle factors, which each have magnitude 1). The numerically imprecise 
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k = 128 


k = 256 




Numerical estimate to power p* 



Numerical estimate to power p* 


Figure 1: Empirical estimate of r to construct a piecewise method. 

For each k £ {128,256,512,1024}, 32 replicate max-convolutions (on vectors 
filled with uniform values) are performed. Error from two sources can be seen: 
error due to underflow is depicted in the sharp left mode, whereas error due to 
imperfect approximation, where || • || p * > || ■ Hoc can be seen in the gradual mode 
on the right. Error due to p*-norm approximation is significantly smaller when 
p* is larger (thereby flattening the right mode), but larger p* values are more 
susceptible to underflow, pushing more indices into the left mode. Regardless 
of the value of k, error due to underflow occurs when (|| • || p ») p goes below 
ss 10 —15 ; this is approximately the numerical tolerance for r described by the 
numpy documentation. Therefore, at each index m we can construct a piecewise 
method that uses the largest value of p* for which the FFT convolution result is 
not close to the machine precision (ie., (||'M* m )|| p .) p > r for some r > ICE 15 ). 
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routines are thus limited to (x + y) — x; when x >> y (i.e., | < e « 10~ la , the 
machine precision), then (x + y) — x will return 0 instead of y. To recover at 
least one bit of the significand, the intermediate results of the FFT must sur¬ 
pass machine precision e (since the worst case addition initially happens with 
the maximum x = 1.0). 

The maximum sum of any values from a list of k such elements can never 
exceed fc; for this reason, a conservative estimate of the numerical tolerance of 
an FFT (with regard to underflow) will be the smallest value of y for which 
\ > e; thus, y > ek. This yields a conservative estimate of the minimum value 
in one index at the result of an FFT convolution: when the result at some index 
m is > ek, then the result should be numerically stable. For this reason, we 
use a numerical tolerance r = 10~ 12 , thereby ensuring that the vast majority 
of numerical error for the numerical max-convolution algorithm is due to the 
p*-norm approximation (i.e., employing ||u( m )|| p » instead of Hu^Hoo) and not 
due to the long-used and numerically performant FFT result. Furthermore, 
in practice the mean squared error due to FFT will be much smaller than 
the conservative worst-case outlined here, because it is difficult for the largest 
intermediate summed value (in this case x) to be consistently large when many 
such very small values (in this case y) are encountered in the same list. Although 
t could be chosen specifically for a problem of size k, note that this simple 
derivation is very conservative and thus it would be better to use a tighter 
bound for choosing r. Regardless, for an FFT implementation that isn’t as 
performant ( e.g ., because it uses float types instead of double), increasing r 
slightly would suffice. 

Therefore, from this point forward we consider that the dominant cause of 
error to come from the max-convolution approximation. Using larger p* values 
will provide a closer approximation; however, using a larger value of p* may 
also drive values to zero (because the inputs L and R will be normalized within 
Algorithm [l] so that the maximum of each is 1 when convolved via FFT), 
limiting the applicability of large p* to indices m for which vM[m\ > r. 

Through this lens, the choice of p* can be characterized by two opposing 
sources of error: higher p* values better approximate ||u^ m ^|| p . but will be nu¬ 
merically unstable for many indices; lower p* values provide worse approxima¬ 
tions of ||u( m )|| p . but will be numerically unstable for only few indices. These 
opposing sources of error pose a natural method for improving the accuracy of 
this max-convolution approximation. By considering a small collection of p* 
values, we can compute the full numerical estimate (at all indices) with each p* 
using Algorithm [l] computing the full result at a given p* is £ 0(k\og 2 (k)), so 
doing so on some small number c of p* values considered, then the overall run¬ 
time will be £ 0(ck\og 2 (k)). Then, a final estimate is computed at each index 
by using the largest p* that is stable (with respect to underflow) at that index. 
Choosing the largest p* (of those that are stable with respect to underflow) cor¬ 
responds to minimizing the bleed-in error, because the larger p* becomes, the 
more muted the non-maximal terms in the norm become (and thus the closer 
the p*-norm becomes to the true maximum). 

Here we introduce this piecewise method and compare it to the simpler low- 
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value p* = 8 and high-value p* = 64 methods and analyze the worst-case error 
of the piecewise method. 


Algorithm 2 Piecewise numerical max-convolution , a numerical method 
to estimate the max-convolution of nonnegative vectors (revised to reduce bleed- 
in error). This procedure uses a p* close to the largest possible stable value 
at each result index. The return value is a numerical estimate of the max- 
convolution L * max R. The runtime is in 0(fclog 2 (fc) log 2 (p£ iax )). 

1: procedure numericalMaxConvolvePiecewise(L, R, p^) 

2: t max <- argmax^ L[£] 

3: r max <- argmax r R[r] 


5: R' <— „[ 11 — T l> Scale to a proportional problem on L' R' 

Tt[r max J 

6: allPStar <- [2°, 2 1 ,.2f log2(p »« ) l ] 

7: for i € {0, 1, ... len(allPStar)} do 

8: resForAUPStar[i] 4— fftNonnegMaxConvolveGivenPStar(Z/, R’, 

allP Star[i]) 

9: end for 

10: for m € {0,1,.. . len(L) + len(R) — 1} do 

11: maxStablePStarIndex[m\ <— max{i : {r es For AllP Star [i\ [r/r] ) allPStar ^ > 

t)} 

12: end for 

13: for m € {0, 1,.. . len(L) + len(R) — 1} do 

14: i 4— maxStablePStarIndex[m\ 

15: result[m\ r es For All P Star [i][m\ 

16: end for 

17: return L[£ ma x] x i?[r max ] x result l> Undo previous scaling 

18: end procedure 


3 Results 

This section derives theoretical error bounds as well as a practical comparison on 
an example for the standard piecewise method. Furthermore the development 
of an improvement with affine scaling is shown. Eventually, an evaluation of the 
latter is performed on a larger problem. Therefore we applied our technique to 
compute the Viterbi path for a hidden Markov model (HMM) to assess runtime 
and the level of error propagation. 

3.1 Error and Runtime Analysis of the Piecewise Method 

We first analyze the error for a particular underflow-stable p* and then use that 
to generalize to the piecewise method, which seeks to use the highest underflow- 
stable p*. 








3.1.1 Error Analysis for a Fixed Underflow-Stable p*: 

We first scale L and R into L' and R' respectively, where the maximum elements 
of both L' and R' are 1; the absolute error can be found by unsealing the absolute 
error of the scaled problem: 

| exact(L, R)[m\ — numeric{L', R')[m\\ 

= maxLf^l max_R[r] \exact(L', R^lm] — numeric(L', R')[m}\. 

i r 

We first derive an error bound for the scaled problem on L' , R! (any mention 
of a vector u refers to the scaled problem), and then reverse the scaling to 
demonstrate the error bound on the original problem on L , R. 

For any particular “underflow-stable” p* (i.e., any value of p* for which 

(II lip*) P > T )> the absolute error for the numerical method for fast max- 
convolution can be bound fairly easily by factoring out the maximum element 
of u'- m > (this maximum element is equivalent to the Chebyshev norm) from the 
p*-norm: 

| exact(L', R')[m\ — numeric{L' .R'^m] \ 



where v^ m ' ) is a nonnegative vector of the same length as u^ m ' ) (this length 
is denoted k m ) where v^ m ' > contains one element equal to 1 (because the maxi¬ 
mum element of u^ must, by definition, be contained within u( m ' > ) and where 
no element of v'- m > is greater than 1 (also provided by the definition of the 
maximum). 


ll^ (m) ll P * < l|(M,...l)|| P * 

1 
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Thus, since ||i/ m )||p* > 1, the error is bound: 

\exact(L', R')[m] — numeric(L ', R')[m] 



because Vm, ||u^ m ^||oo < 1 for a scaled problem on li,R f . 


3.1.2 Error Analysis of Piecewise Method 

However, the bounds derived above are only applicable forp* where > 

t. The piecewise method is slightly more complicated, and can be partitioned 
into two cases: In the first case, the top contour is used ( i.e ., when p^ ax is 
underflow-stable). Conversely, in the second case, a middle contour is used 
(i.e., when p^ ax is not underflow-stable). In this context, in general a contour 
comprises of a set of indices m with the same maximum stable p *. 

In the first case, when we use the top contour p* = p(jj ax , we know that p^ax 
must be underflow-stable, and thus we can reuse the bound given an underflow- 
stable p*. 

In the second case, because the p* used is < p) nax , it follows that the next 
higher contour (using 2p*) must not be underflow-stable (because the highest 
underflow-stable p* is used and because the p* are searched in log-space). The 
bound derived above that demonstrated 

||« (m) ||p. < Halloo*# 


can be combined with the property that || • || p * > || • for any p* > 1 to show 


that 


lM m) ||oo G 




Thus the absolute error can be bound again using the fact that we are in a 
middle contour: 


= 

||u (m) 

Up* 

Halloo 

= 

||u (m) 

IIp- 

(, ll« (m) ||oo\ 

l iM m) iip*; 


||u (m) 



< 

IIp* 

i - 


< 
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The absolute error from middle contours will be quite small when p* = 1 

1 

is the maximum underflow-stable value of p* at index m, because t 2 ^, the 

-1 

first factor in the error bound, will become \/t « 10~ 6 , and 1 — km < 1 
(qualitatively, this indicates that a small p* is only used when the result is very 
close to zero, leaving little room for absolute error). Likewise, when a very large 

=$ alT V'.,. 

p* is used, then 1 — km becomes very small, while r 2 p* <1 (qualitatively, this 
indicates that when a large p* is used, the || • || p * ~ || • Hoc, and thus there is 
little absolute error). Thus for the extreme values of p*, middle contours will 
produce fairly small absolute errors. The unique mode p* mo d e can be found by 
finding the value that solves 


d 


dp 


* 

mode 



= o, 


which yields 

* _ log 2 (k m ) 

Pmode - 2 log 2 (fc) —log 2 (t) \ * 

^°§2 V l og2 (r) > 

An appropriate choice of p^ ax should be > Pmode so that the error for any 
contour (both middle contours and the top contour) is smaller than the error 
achieved at Pmode > allowing us to use a single bound for both. Choosing Pm ax = 
Pmode would guarantee that all contours are no worse than the middle-contour 
error at P* node i however, using Pm ax = Pmode I s shill quite liberal, because it would 
mean that for indices in the highest contour (there must be a nonempty set of 
such indices, because the scaling on L' and R' guarantees that the maximum 
index will have an exact value of 1, meaning that the approximation endures no 
underflow and is underflow-stable for every p*), a better error could be achieved 
by increasing p* lax . For this reason, we choose Pm ax so that the top-contour 
error produced at p* lax is not substantially larger than all errors produced for 
p* before the mode (i.e., for p* < p* mo de)- 

Choosing any value of p* lax > Pm ode guarantees the worst-case absolute error 
bound derived here; however, increasing p* lax further over Pm ode may possibly 
improve the mean squared error in practice (because it is possible that many in¬ 
dices in the result would be numerically stable with p* values substantially larger 
than Pmode)- However, increasing p^ ax >> Pm ode will produce diminishing re¬ 
turns and generally benefit only a very small number of indices in the result, 
which have exact values very close to 1. In order to balance these two aims (in¬ 
creasing Pm ax enough over Pmode but n °t excessively so), we make a qualitative 
assumption that a non-trivial number of indices require us to use a p* below 
Pmodei therefore, increasing Pm ax to produce an error significantly smaller than 
the lowest worst-case error for contours below the mode (i.e. p* < Pmode ) w hl 
increase the runtime without significantly decreasing the mean squared error 
(which will become dominated by the errors from indices that use p* < Pmode)- 
The lowest worst-case error contour below the mode is p* = 1 (because the 
absolute error function is unimodal, and thus must be increasing until Pmode 
and decreasing afterward); therefore, we heuristically specify that Pm ax should 
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produce a worst-case error on a similar order of magnitude to the worst-case er¬ 
ror produced with p* = 1. In practice, specifying the errors at ax and p* = 1 
should be equal is very conservative (it produces very large estimates of p* nax , 
which sometimes benefit only one or two indices in the result); for this reason, 
we heuristically choose that the worst-case error at p^ ax should be no worse 
than square root of the worst case error at p* = 1 (this makes the choice of p^ ax 
less conservative because the errors at p* = 1 are very close to zero, and thus 
their square root is larger). The square root was chosen because it produced, for 
the applications described in this paper, the smallest value of p^ ax for which the 
mean squared error was significantly lower than using p^ ax = P* mo de (th- e lowest 
value of Pm ax guaranteed to produce the absolute error bound). This heuristic 
does satisfy the worst-case bound outlined here (because, again, p^ ax > p))j 0( j e ), 
but it could be substantially improved if an expected distribution of magnitudes 
in the result vector were known ahead of time: prior knowledge regarding the 
number of points stable at each p* considered would enable a well-motivated 
choice of p^ ax that truly optimizes the expected mean squared error. 

From this heuristic choice of p^ ax , solving 

1 

= h P max — 1 

(with the square root of the worst-case at p* = 1 on the left and the worst-case 
error at pj^ax 011 the right) yields 

log 2 (fc) 

/'max /- 

io g2 (i + \]Vt (i - £)) 

_ log 2 (fc) 
log 2 (l + 

for any non-trivial problem (i.e., when k >> 1), and thus 

Pmax ~ l°g 1 + r i( fc )> 

indicating that the absolute error at the top contour will be roughly equal to 
the fourth root of r. 


3.1.3 Worst-case Absolute Error 

By setting p^ ax in this manner, we guarantee that the absolute error at any 
index of any unsealed problem on L, R is less than 

max LIT] max R[r] r 

l r 

where is defined above. The full formula for the middle-contour error 

at this value of p^ode c l° es not simplify and is therefore quite large; for this 
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reason, it is not reported here, but this gives a numeric bound of the worst case 
middle-contour error that is bound in terms of the variable k (and with no other 
free variables). 


3.1.4 Runtime Analysis 

The piecewise method clearly performs log 2 (Pm ax ) FFTs (each requiring 
0(k\og 2 {k)) steps); therefore, since is chosen to be log (, k ) (to achieve 

the desired error bound), the total runtime is thus 


0(fclog 2 (ft) log 2 (log i+i _i (&)). 


For any practically sized problem, the log 2 (log i 1 (fc)) factor is essentially a 
constant; even when k is chosen to be the number of particles in the observable 
universe (« 2 270 ; lEddingtonl 119231), the log 2 (log i ( k )) is ~ 18, meaning that 

_ _ 1+T 4 

for any problem of practical size, the full piecewise method is no more expensive 
than computing between 1 and 18 FFTs. 


3.2 Comparison of Low-Value p* = 8, High-value p* = 64, 
and Piecewise Method 

We first use an example max-convolution problem to compare the results from 
the low-value p* = 8, the high-value p* = 64 and piecewise methods. At every 
index, these various approximation results are compared to the exact values, as 
computed by the naive quadratic method (Figure [2a|. 


3.3 Improved Affine Piecewise Method 


Figure |2b| depicts a scatter plot of the exact result vs. the piecewise approx¬ 
imation at every index (using the same problem from Figure 2a). It shows 
a clear banding pattern: the exact and approximate results are clearly corre¬ 
lated, but each contour {i.e., each collection of indices that use a specific p*) 
has a different average slope between the exact and approximate values, with 
higher p* contours showing a generally larger slope and smaller p* contours 
showing greater spread and lower slopes. This intuitively makes sense, because 


-1 

the bounds on ||u^ m ^||c» £ [||u( m )|| p » ||«^|| P *] derived above constrain the 

scatter plot points inside a quadrilateral envelope (Figure [3]). 

The correlations within each contour can be exploited to correct biases that 
emerge for smaller p* values. In order to do this, \\u^ ||oo must be computed for 
at least two points mi and m 2 within the contour, so that a mapping || p » ss 
f{\\u( m )\\ p t) = a||u("4||p. +b can be constructed. Fortunately, a single ||tf^||oo 
can be computed exactly in O(k) (by actually computing a single u and 
computing its max, which is equivalent to computing a single index result via 
the naive quadratic method). As long as the exact value ||m^"4||oo is computed 
for only a small number of indices, the order of the runtime will not change (each 
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(a) (b) 


Figure 2: The accuracy of numerical fast max-convolution methods. 

(a) Different approximations for a sample max-convolution problem. The low-p* 
method is underflow-stable, but overestimates the result. The high-p* method 
is accurate when underflow-stable, but experiences underflow at many indices. 
The piecewise method stitches together approximations from different p* to 
maintain underflow-stability, (b) Exact vs. piecewise approximation at various 
indices of the same problem. A clear banding pattern is observed with one tight, 
elliptical cluster for each contour. The slope of the clusters deviates more for 
the contours using lower p* values. 


contour already costs 0(k log 2 (fc)), so adding a small number of 0{k ) steps for 
each contour will not change the asymptotic runtime). 

If the two indices chosen are 

m min = argmin ||w (m) || p * 

m(z:Contour(p*) 


m mSLX = argmax ||w (m) || p *, 

m€contour(p *) 

then we are guaranteed that the affine function / can be written as a con¬ 
vex combination of the exact values at those extreme points (using barycentric 
coordinates): 


/(||« (m) || P *) = A m || U ( m ““)|| 00 + (1 - A m ) ||M (m " in) ||oo 
_ h (m) || P * - ||u (m ° 1 ° ) ||p. 

m ||u( mmax )||p. — ||p» ’ 

Thus, by computing ||u^ mmin - ) ||c» and || u ( miaa *)(each in O(k) steps), we 
can compute an affine function / to correct contour-specific trends (Algo¬ 
rithm [ 3 ]). 
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Results at contour p*=8.0 



Figure 3: A single contour from the piecewise approximation. The 

cluster of points (one point for each index in the previous figure) is bounded by 
the exact value (ideal approximation) and the approximation upper-bound for 
p* = 8 (worst-case approximation). The points are well described by an affine 
function fit using the left-most and right-most points. 




Figure 4: Piecewise method with affine contour fitting. The approximate 
values at each index of the max-convolution problem are almost identical to the 
exact result at the same index. 
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Algorithm 3 Improved affine piecewise numerical max-convolution, a 

numerical method to estimate the max-convolution nonnegative vectors (fur¬ 
ther revised to reduce numerical error). This procedure uses a p* close to 
the largest possible stable value at each result index. The return value is 
a numerical estimate of the max-convolution L * max R. The runtime is in 

Q(fcl0g 2 (fc) log 2 (.Pmax)) • _ 

1: procedure numericalMaxConvolvePiecewiseAffine(L, R, p(n ax ) 

2: t max «- argmax^ L[£\ 

3: r max «- argmax,, R[r] 

4: L' 1— Jjr- -T 

Lj pmaxj 

5: R! 4— — T l> Scale to a proportional problem on L ', R' 

te\r max J 

6: allPStar <— [2°, 2\..., 2 r iog 2 (Pma X )l ] 

7: for i € {0,1, ... len{allP Star)} do 

8: resForAHPStar[i\ <— fftNonnegMaxConvolveGivenPStar(Z/, R’, 

allP Star[i]) 

9: end for 

10: for m € {0,1,... . len(L) + len(R) — 1} do 

11: maxStablePStarIndex[m] <— max{i : (resForAllPStar[i\ [ m ]) aI1PStar M > 

t)} 

12: end for 

13: result <— aff ineCorrect(resf 7 ’or AllPStar, max StableP Star Index) 

14: return L[f max ] x _R[r max ] x result !> Undo previous scaling 

15: end procedure 
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Algorithm 4 Subroutine for correcting errors in a contour, with an 
affine transformation based on exact boundary points. It needs the results of 
the evaluation of the different p-norms as well as the (index of the) maximum 
stable values of p* at every index. 

1: procedure AFFiNECoRRECT(resForAllP Star, max StableP Star Index) 

2 : Vi, slope[i ] <— 1 

3: Vi, bias[i] <— 0 

4: usedP Star «— set {max StableP Star Index) 

5: for i £ usedPStar do 

6: contour ■£- {m : max StableP Star Index[m] = i} 

7: mMin £- argmin m econtourresForAUPStar[i\[m\ 

8: mMax 4— argmaXmecontourresForAUPStar[i][m] 

9: xMin <— resForAllPStar[i\[mMin] 

10: xMax £- resForAUPStar[i][mMax] 

11: yMin <— maxConvolutionAtlndex(mMin) 

12: yMax <— maxConvolutionAtlnd ex(mMax) 

13: if xMax > xMin then 

14 - slave fil <— y M ax-yMin 

siope[i j v- xMax _ xMin 

15: bias[i] <— yMin — slope[i] x xMin 

16: else 

17: slope [i] 

18: end if 

19: end for 

20 : for m £ {0,1, ... len{L) + len{R) — 1} do 

21 : i <— maxStablePStarIndex[m) 

22 : result[m] <— resForAHPStar[i][m] x slope[i] + bias[i] 

23: end for 

24: return result 

25: end procedure 
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3.3.1 Error Analysis of Improved Affine Piecewise Method 

By exploiting the convex combination used to define /, the absolute error of 
the affine piecewise method can also be bound. Qualitatively, this is because, 
by fitting on the extrema in the contour, we are now interpolating. If the two 
points used to determine the parameters of the affine function were not chosen 
in this manner to fit the affine function, then it would be possible to choose two 
points with very close x-values (i.e., similar approximate values) and disparate 
y-values (i.e., different exact values), and extrapolating to other points could 
propagate a large slope over a large distance; using the extreme points forces 
the affine function to be a convex combination of the extrema, thereby avoiding 
this problem. 


/(IK (m V) = A m ||«( m —)|| 00 + (1 - A m ) ||u' 

|lP'"mln)|| 


IP 

|K 


A m -- i —— + (1 — A m ) 

IUP* UP* 

^rnmax 'Wrimin 

Am|K (mmax) ||p* + (1 - A m) |K (mmin) | 


c 




u 


(m min ) I 


~—b (1 — A m ) 

kp* kp * 

AJK (mmax) ||p* + (1 - A m ) ||« (mmln) ||p» 
AM (A m |K (m ”“ ) ||p* + (! - A m ) IK (mmin) ||p*) , 

A m ||u (mmax) ||p* + (1 - A m ) |K (mmin) ||p* 


fc^ll« (m) ll P -,ll« (ro) l 


The worst-case absolute error of the scaled problem on L ', R' can be defined 


- lb,Ml 


max | /(|K 


Because the function /(|K^ m ^llp*) — II 
be zero, and thus Lagrangian theory states that the maximum must occur at a 
boundary point. Therefore, the worst-case absolute error is 


is affine, it’s derivative can never 


< max{|K 


M 


- 


1! u 


Ml 


- ip 


-1 

• kp* } 




- u' 


which is identical to the worst-case error bound before applying the affine trans¬ 
formation f. Thus applying the affine transformation can dramatically improve 
error, but will not make it worse than the original worst-case. 
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3.4 Demonstration on Hidden Markov Model With 
Toeplitz Transition Matrix 

One example that profits from fast max-convolution of non-negative vectors is 
computing the Viterbi path using a hidden Markov model (HMM) (ie., the 
maximum a posteriori states) with an additive transition function satisfying 
Pr(Xj_|_i = a\Xi = b) oc S(a — b) for some arbitrary function S (d can be 
represented as a table, because we are considering all possible discrete functions). 
This additivity constraint is equivalent to the transition matrix being a “Toeplitz 
matrix”: the transition matrix T a ^ = Pr(X i+ ] = a\Xi = b) is a Toeplitz matrix 
when all cells diagonal from each other (to the upper left and lower right) have 
identical values (ie., Va,V&, T a ^ = T a+ i,b+i). Because of the Markov property 
of the chain, we only need to max-marginalize out the latent variable at time i 
to compute the distribution for the next latent variable Xi + \ and all observed 
values of the data variables Dq ... Di + \. This procedure, called the Viterbi 
algorithm, is continued inductively: 


max Pr(£>o,T)i,... A-i,-Y 0 = x 0 ,Xx = xi,...,X z = Xi ) = 

max max Pr(.D 0 , £>i,... A_ 2 ,X 0 = x 0 , Xi = x\,... ,Xj_i = Xj_i) 

Xi -1 Xo,Xi,...Xi-2 

Pi (Tlj — 1 |^z— 1 — Xi— 1) Pi (Xi — Xi | Xi—i — Xi—i') 


and continuing by exploiting the self-similarity on a smaller problem to pro¬ 
ceed inductively, revealing a max-convolution (for this specialized HMM with 
additive transitions): 

= max fromLeft(i - 1] Pr(A-i|^i-i = ah-i)^ - Xj_i] = 

Xi-1 

(fromLeft[i — 1] likelihood{Di-i]) * max 5[xi — Xi-\]. 


After computing this left-to-right pass (which consisted of n — 1 max- 
convolutions and vector multiplications), we can find the maximum a posteriori 
configuration of the latent variables X 0 ,... X n _i = Xq, ... x^_ ± backtracking 
riglit-to-left, which can be done by finding the variable value Xi that maxi¬ 
mizes fromLeft[i][xi] x 5[a;* +1 — Xi] (thus defining x* and enabling induction 
on the right-to-left pass). The right-to-left pass thus requires 0(nk) steps (Al¬ 
gorithm [5]). Note that the full max-marginal distributions on each latent 
variable Xi can be computed via a small modification, which would perform 
a more complex right-to-left pass that is nearly identical to the left-to-right 
pass, but which performs subtraction instead of addition (ie., by reversing the 
vector representation of the PMF of the subtracted argument before it is max- 
convolved; 


Serang 2014). 
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Algorithm 5 Viterbi for models with additive transitions, which ac¬ 
cepts the length k vector prior, a list of n binned observations data, a a x k 
matrix of likelihoods (where a is the number of bins used to discretize the data) 
likelihoods, and a length 2k — 1 vector <5 that describes the transition probabil¬ 
ities. The algorithm returns a Viterbi path of length n, where each element in 
the path is a valid state € {0,1,... k — 1}. 

1: procedure VlTERBlFORADDmvETRANSlTlONS(prior, data, likelihood , S) 

2: fromLeft[ 0] «— prior 

3: for i = 0 to n — 2 do 

4: fromLeft[i\ <— fromLeft\i\ x likelihood[data[i]] 

5: fromLeft[i + 1] «— fromLeft[i] * max <5 

6 : end for 

7: fromLeft[n\ <— fromLeft[n\ x likelihood[data[n ]] 

8 : 

9: path[n — 1] <— argmax^ from.Left[n — 1][j] 

10: for i = n — 2 to 0 do 

11: maxProdPosterior < - 1 

12: argmaxProdPosterior < -1 

13: for l = k to 1 do 

14: currProdPosterior «— fromLeft[i\ x <5[Z — path[i + 1]] 

15: if currProdPosterior > maxProdPosterior then 

16: maxProdPosterior currProdPosterior 

17: argmaxProdPosterior l 

18: end if 

19: end for 

20: path[i] argmaxProdPosterior 

21: end for 

22: return path 

23: end procedure 
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We apply this HMM with additive transition probabilities to a data analy¬ 
sis problem from economics. It is known for example, that the current figures 
of unemployment in a country have (among others) impact on prices of com¬ 
modities like oil. If one could predict unemployment figures before the usual 
weekly or monthly release by the responsible government bureaus, this would 
lead to an information advantage and an opportunity for short-term arbitrage. 
The close relation of economic indicators like market prices and stock market 
indices (especially of indices combining several stocks of different industries) to 
unemployment statistics can be used to tackle this problem. 

In the following demonstration of our method, we create a simple HMM 
with additive transitions and use it to infer the maximum a posteriori unem¬ 
ployment statistics given past history (i.e. how often unemployment is low 
and high, as well as how often unemployment goes down or up in a short 
amount of time) and current stock market prices (the observed data). We 
discretized random variables for the observed data (S&P 500, adjusted closing 
prices ; retrieved from YAHOO! historical stock prices: http: //data. bis. gov/ 
cgi-bin/surveymost?blsseriesCUUROOOOSAO), and ’’latent” variables (unem¬ 
ployment insurance claims, seasonally adjusted, were retrieved from the U.S. De¬ 
partment of Labor: https://www.oui.doleta.gov/unemploy/claims.asp). 
Stock prices were additionally inflation adjusted by (i.e. divided by) the con¬ 
sumer price index (CPI) (retrieved from the U.S. Bureau of Labor Statistics: 
https : //finance .yahoo. com/q?s=~GSPC). The intersection of both ’’latent” 
and observed data was available weekly from week 4 in 1967 to week 52 in 2014, 
resulting in 2500 data points for each type of variable. 

To investigate the influence of overfitting, we partition the data in two parts, 
before June 2005 and after June 2005, so that we are effectively training on 
20 2500 o° = 80% the data points, and then demonstrate the Viterbi path 
on the entirety of the data (both the 80% training data and the 20% of the 
data withheld from empirical parameter estimation). Unemployment insurance 
claims were discretized into 512 and stock prices were discretized into 128 bins. 
Simple empirical models of the prior distribution for unemployment, the like¬ 
lihood of unemployment given stock prices, and the transition probability of 
unemployment were built as follows: The initial or prior distribution for un¬ 
employment claims at i = 0 was calculated by marginalizing the time series of 
training data for the claims (i.e. counting the number of times any particular 
unemployment value was reached over all possible bins). Our transition func¬ 
tion (the conditional probability Pr(X i+1 |A,)) similarly counts the number of 
times each possible change X i+i — X i: G {—511, —510,... 511} occurred over all 
available time points. Interestingly, the resulting transition distribution roughly 
resembles a Gaussian (but is not an exact Gaussian). This underscores a great 
quality of working with discrete distributions: while continuous distributions 
may have closed-forms for max-convolution (which can be computed quickly), 
discrete distributions have the distinct advantage that they can accurately ap¬ 
proximate any smooth distribution. Lastly, the likelihoods of observing a stock 
price given the unemployment at the same time were trained using an empirical 
joint distribution (essentially a heatmap), which is displayed in Figure [H] 
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Figure 5: Heatmap for trained likelihood matrix. This heatmap depicts a 
joint empirical distribution between the S&P 500 index and new unemployment 
claims, which share a tenuous inverse relationship. Given Di, the discretized 
stock index value at time i, row Di contains the likelihood table Pr(Dj|Xj), 
which is denoted likelihood[data[i]\ in the code. 
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We compute the Viterbi path two times: First we use naive, exact max- 
convolution, which requires a total of 0(nk 2 ) steps. Second, we use fast nu¬ 
merical max-convolution, which requires 0(n k log(fc) log(log(fc)) steps. Despite 
the simplicity of the model, the exact Viterbi path (computed via exact max- 
convolution) is highly informative for predicting the value of unemployment, 
even for the 20% of the data that were not used to estimate the empirical prior, 
likelihood, and transition distributions. Also, the numerical max-convolution 
method is nearly identical to the exact max-convolution method at every index 
(Figure [6]). Even with a fairly rough discretization (ie., k = 512), the fast nu¬ 
merical method used 141.4 seconds compared to the 292.3 seconds required by 
the naive approach. This speedup will increase dramatically as k is increased, 
because the log(log(fc)) term in the runtime of the numerical max-convolution 
method is essentially bounded above log(log(fc)) < 18. 

3.5 An Improved Approximation of the Chebyshev Norm 

Although the p*-norm provides a good approximation of the Chebyshev norm, 
it discards significant information; specifically the curve for various p* 

could be used to identify and correct the worst-case scenario where j| M "i)|| = 

(1,1,... 1); using only two points, the exact value of ||u^ m ^||oo can be computed 
for those worst-case u( m ' > vectors by computing the norms at two different p* 
values and solving the following equations for /3i: 

ll« (m) llg « $ 

Wu^Wll (X 

where the proportionality constant is k m = len(u^ m ^) and where the computed 
value /3i yields the exact Chebyshev norm ||w^ m ^||oo. 

3.5.1 A Projection-Based Method for Estimating ||it( m )||oo 

More generally, when there are e m < k m unique values (/?*) in vf' m \ we can 
model the norms perfectly with 

ii« (m) ii£ = £m?‘ 

i 

where hi is an integer that indicates the number of times occurs in (and 
where JV hi = k m = len(u^ m ^)). This multi-set view of the vector u^ can be 
used to project it down to a dimension r: 
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Figure 6: Viterbi analysis of employment given stock index values. 

The Viterbi path corresponding to the maximum a posteriori prediction of the 
number of new unemployment insurance claims is produced for a model where 
the state transition probabilities are additive. The exact Viterbi estimate tracks 
well with the true unemployment values. Training parameters were taken from 
only the true unemployment data to the left of the vertical dotted line; however, 
the Viterbi paths to the right of the dotted line (where unemployment data were 
withheld from the likelihood, prior, and transition parameters) also track well 
with the true unemployment statistics. The Viterbi path computed with fast 
numerical max-convolution (via the affine piecewise approach) is nearly identical 
to the result computed with the slower exact approach. 
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By solving the above system of equations for all a,, the maximum a = max,; a* 
can be used to approximate the true maximum max.; /?; = Hoo. This projec¬ 
tion can be thought of as querying distinct moments of the distribution pmf t ;< m ) 
that corresponds to some unknown vector u^ m \ and then assembling the mo¬ 
ments into a model in order to predict the unknown maximum value in u^ m \ 
Of course, when r, the number of terms in our model, is sufficiently large, then 
computing r norms of u'- m will result in an exact result, but it could result 
in 0{km) execution time, meaning that our numerical max-convolution algo¬ 
rithm becomes quadratic; therefore, we must consider that a small number of 
distinct moments are queried in order to estimate the maximum value in u. 
Regardless, the system of equations above is quite difficult to solve directly via 
elimination for even very small values of r, because the symbolic expressions 
become quite large and because symbolic polynomial roots cannot be reliably 
computed when the degree of the polynomial is > 5. Even in cases when it can 
be solved directly, it will be far too inefficient. 

For this reason, we solve for the a; values using an exact, alternative ap¬ 
proach: If we define a polynomial j(x) = (x — of ''j [x — ^ • • • (x — a? ), 

then x £ { a\ , a% ,... a? } 4$ j(x) = 0. We can expand j(x) = 7o + JiX + 
J 2 X 2 + ■ ■ ■ + ”f r x r , and then write 
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which indicates that 


7o 7l 72 • • • 7 r 


ll« (m) ll£ 

\\n^\\% 

II"™ Ha? 


= 0. 



Furthermore, 7(2:) = 0, x ^ 0 <t 7 x l j(x ) =0,i€N; therefore we can write 
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Because the columns of 
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must be linearly independent when on, 0% ... are distinct (which is the case by 
the definition of our multiset formulation of the norm), then r = | will determine 
a unique solution; thus the null space above is computed from a matrix with r+1 
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columns and r rows, yielding a single vector for (70,71, • ■ • J r )- This vector can 
then be used to compute the roots of the polynomial 7 o + 7iX+72X 2 + - • • +'y r x r , 
which will determine the values {o^ , af ,a? }, which can each be taken to 
the T- power to compute {ai, c*2, • • •, a r }; the largest of those cq values is used 


as the estimate of the maximum element in u^ m \ When contains at least 


r distinct values ( i.e., e m > r), then the problem will be well-defined; thus, if 
the roots of the null space spanning vector are not well-defined, then a smaller r 
can be used (and should be able to compute an exact estimate of the maximum, 
since can be projected exactly when r is the precise number of unique 
elements found in u ( m l). 


Note that this projection method is valid for any sequence of norms with 

l p “ +p * lk( m )|| po+2p * lM m )|| po+3p * \\u^\\ P0+ T 

|po+p*’ll u llpo+2p* > II “ llp 0 +3p* ’ ' ' ' II u \\po+tp*' 


even spacing: ||id 


3.5.2 Closed-Form Projection Method for r = 2 


In general, the computation of both the null space spanning vector (70,71,... 7 r ) 
and of machine-precision approximations for the roots of the polynomial 70 + 
712 + 72a: 2 + • • • + "/ r x r (which can be approximated by constructing a matrix 
with that characteristic polynomial and performing eigendecomposition (Horn 
and Johnson ( 1999 )) are both in 0 (r 3 ) for each index m in the result; however, 
by using a small r = 2, we can compute a closed form solution of both the 
null space spanning vector and of the resulting quadratic roots. This enables 
faster exploitation of the curve of norms for estimating the maximum value of 
u (m) (although it doesn’t achieve the high accuracy possible with a much larger 
r « e). This is equivalent to approximating ||u^ m ^||p» ss h\a\ + /12a2 , where 
h\ + hi = k m = len^u^). 

In this case, the single spanning vector of the null space of 


will be 


ii« (m) n p : 

ii^m; 




7o 


7i 
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_ 72 _ 



ii« (m) nS wu^wz: 




and thus a « ||w^ m ^|| 00 can be computed by using the quadratic formula to 

solve 70 + 7i^ + 72a; 2 = 0 for x, and computing a using the maximum of those 
1 

zeros: a = i max p r . When the quadratic is not well defined, then this indicates 
that the number of unique elements in v> m ’ is less than 2, and thus cannot be 
projected uniquely (i.e., e m < r); in this case, the closed-form linear solution 
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can be used rather than a closed-form quadratic solution: 



When the closed-form linear solution is not numerically stable (due to division 
by a value close to zero), then the p*-norm approximation can likewise be used. 

3.5.3 Adapted Piecewise Algorithm Using Interleaved p* Points 

Because the norms must have evenly spaced p* values in order to use the projec¬ 
tion method described above, the exponential sequence of p* values used in the 
original piecewise algorithm will not contain four evenly spaced points (which 
are necessary to solve the quadratic formulation, i.e. r = 2). One possible 
solution would be to take the maximal stable value of p* for any index (which 
will be a power of two found using the original piecewise method), and then 
also computing norms (via the FFT, as before) for p* — 35, p* — 25, p* — 5, p*; 
however, this will result in a 4x slowdown in the algorithm, because for every 
p*-norm computed via FFT before, now four must be computed. An alternative 
approach reuses existing values in the 2* sequence of p*: for p* sufficiently large, 
then the exponential sequence is guaranteed to include these stable p* values: 
£j-, \,p*■ By considering in p* candidates, then we can be guaranteed to 
have four evenly spaced and stable p* values. This can be achieved easily by 
noting that 

3 £+P * 

4 2 ’ 

meaning that we can insert all possible necessary p* values for evenly spaced 
sequences of length four by first computing the exponential sequence of p* 
values and then inserting the averages between every pair of adjacent pow¬ 
ers of two (and inserting them in a way that maintains the sorted order): 
1,2,4,8,16,... becomes 1,1.5,2,3,4,6,8,12,16,.... Thus, if (for some index 
m) 16 is the highest stable p* that is a power of two (i.e., the p* value that 
would be used by the original piecewise algorithm), then we are guaranteed 
to use the evenly spaced sequence 4, 8,12,16. By interleaving the powers of 
two with the averages from the following powers of two, we reduce the number 
of FFTs to 2x that used by the original piecewise algorithm. For small val¬ 
ues of r (such as the r = 2 used here), the estimation of the maximum from 
each sequence of four norms is in 0(4k), meaning the total time will still be 
fclog(fc) log(log(/c) +4 k £ 0(k\og(k) log(log(fc))), which is the same as before. 
Because the spacing in this formulation is and given the maximal root of 

4 

the quadratic polynomial 7 (x max ) = 0, then a = re max (taking the maximal 
root £ max to the power -4 instead of -V, which had been the spacing used in 
the description of the projection method). The null space projection method is 
shown in Algorithm [6] 
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Algorithm 6 Piecewise numerical max-convolution with projection, a 

numerical method to estimate the max-convolution of two PMFs or nonnegative 
vectors. This method uses a nullspace projection to achieve a closer estimate 
of the true maximum. Depending on the number of stable estimates, linear or 
quadratic projection is used. The parameters are two nonnegative vectors L' 
and R! (both scaled so that they have maximal element 1). The return value is 
a numerical estimate of the max-convolution L' * max R' • 

1: procedure numericalMaxConvolvePiecewiseProjectionAffine(L', R', p*) 

2: tmax <- argmax^ L[£\ 

3: r max <- argmax r R[r] 

4: L ' WFFT] 

5: R' «— 11 — T t> Scale to a proportional problem on L' R' 

Tt[r max J 

6: allPStar <- [2“\2°, 2 1 ,..., 2 + 2l log2(p »“ ) J ] 

7: for h £ {0,1,... len(allPStar)} do 

8: allPStarInterleaved[2i ] «— allPStar[i] 

9: allPStar Interleaved^ + 1] <— 0.5 x ( allPStar[i] + allPStar[i + 1]) 

10: end for 

11: for i £ {0,1,... len{allPStar)} do 

12: resForAHPStar[i\ <— fftNonnegMaxConvolveGivenPStar(Z/, R', 

allPStarInterleaved[i ]) 

13: end for 

14: for m £ {0,1,... len(L) + len(R) — 1} do 

15: maxStablePStarIndex[m\ <— max{i : (resForAllP£tar[i][m]) allPStarInterleaved ^ > 

t)} 

16: end for 

17: for o £ {0,1,... len{maxStableP Star Index)} do 

18: maxStablePStarIndex[d\— = maxStablePStarIndex[o ]%2 > Restrict to 

powers of 2 

19: end for 

20: for p £ {0,1,... len(maxStableP Star Index)} do 

21: maxP allPStarInterleaved[maxStablePStarIndex[p\] 

22: spacing <— 0.25 * maxP 

23: est± <— resForAHPStar[maxStablePStarIndex[p]\ 

24: ests resForAHPStar[maxStablePStarIndex[p] — 1] 

25: if maxStablePStarIndex[p\ < 5 then > Need 5 p* in sequence to get 4 

evenly spaced 

26: resForAUPStar\p\ <— maxLin(esf 3 , est^) 

27: else 

28: est 2 <— resForAllPStar[maxStablePStarIndex\p\ — 2] 

29: esti <— resForAllPStar[maxStablePStarIndex[p\ — 4] > Index - 4 is 

the next evenly spaced point 

30: resForAllPStar\p\ -(— maxQuad(esti, est 2 , estz, esti, spacing) 

31: end if 

32: end for 

33: result aff ineCorrect (resForAllPStar, maxStablePStarlndex) 

34: return L[f ma x] x R[r max ] x result o Undo previous scaling 

35: end procedure 
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Algorithm 7 Linear projection of the maximum, using previously com¬ 
puted values est 3 ,est 4 for two p* with a difference of spacing (estimates given 
in ascending order of their corresponding p’s used). The naming of the variables 

follows the scheme esti = ||u( m )|| To prevent numeric instabilities, the 

algorithm checks for division by zero within a tolerance rpiv — 1 CP 10 (again, 
a conservative estimate of the machine precision). The return value is a new 

estimate of the real maximum. _ 

1 : procedure MAxLiN(est3, esti, spacing) 

2: if |ests| > TDiv then 

3: result «- 

esi 3 

4: else 

5: result <— esti 

6 : end if 

7: return result^ 0/spacin9) 

8 : end procedure 


Algorithm 8 Quadratic projection of the maximum, using previously 
computed estimates est\,est 2 ,est 3 ,est 4 for four equally spaced p in steps of 
spacing (estimates given in ascending order of their corresponding p’s used). 

The naming of the variables follows the scheme esti = |||| To prevent 

numeric instabilities, the algorithm checks for division by zero within a tolerance 
tD iv = 10- 10 . The return value is a new estimate of the real maximum. 

1: procedure MAxQuAD(esti, esfe, estz, esti, spacing) 

2 : 72 <— esti * est 3 — est| 

3: 71 <— est 2 * est 3 — esti * esti 

4 : 70 <— est 2 * esti — est| 

5: preRootValue <— 71 — 4 * 72 * 70 

6 : stableQuadratic (70 > tda) & (preRootValue >= 0.0) 

7: if stableQuadratic then 

8 : result <— (—71 + \/preRootValue/ (2 * 72 ) 

9: else t> Resort to linear projection 

10: result <— maxLin(est 3 , esti) 

11 : end if 

12 : return result^' 0/spacin9) 

13: end procedure 
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3.5.4 Accuracy of the r — 2 Projection-Based Method 

The full closed-form of the quadratic roots used above (which solve the projec¬ 
tion when r = 2) will be 


/v — m tsv 
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where p* = ma * p in the pseudocode ( i.e ., the maximum numerically stable 
p* used by the piecewise algorithm at that index). Note that ||u^ m ^||So can 
be factored out because the exponents in every term in the numerator will be 
5 p* {i.e., lOp* in the square root). Similarly the terms in the denominator 
each contain . Factoring out the maximum value is then the same as 

operating on scaled vectors v (instead of u) with the maximum entry being 1, 
and at least one element of value 1. 

Furthermore, the denominator 272 > 0 ; even though the terms summed to 
compute 72 are not exclusively nonnegative, symmetry can be used to demon- 
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strate that every negative term is outweighed by a unique corresponding term: 


72 = 


> 




(m) (r, 

V) V\ 


— V) 


Thus, for well-defined problems (be., when 72 ^ 0 ), the denominator 272 > 0 , 
and therefore, the maximum root of the quadratic polynomial will correspond 
to the term that adds (rather than subtracts) the square root term: 



The relative absolute error is defined as | a ||„o^)|| ^°° I = I Wu^w^ - 11 i there¬ 

fore, a bound on the relative error of the projection method can be established 
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by bounding 



where the length of the iv- m ’ (respectively v^ m ’) is k m . Using this reformulation, 
s = 1 indicates a zero-error approximation. This can be rewritten to bound its 
value before taking to the power ^: 

s(p*, km) = t(p*,k m )** 


where 



The extreme values of t(p*, k m ) can be found by minimizing and maximizing 
over the possible values of € V = {v : [ 0 , l\ m : Eli, w* = 1 , 3 j,vj € ( 0 , 1 )}. 
The final constraint on Vj in ( 0 , 1 ) is because any v containing only one unique 
value (which must be 1 in this case since dividing by the maximum element in 
to compute v^ has divided the value at that index by itself ( 3 i,vi = 1) 
will lead to instabilities. When values in v are identical to one another, using 
r = 1 yields an exact solution, and thus solving with r = 2 is not well-defined 
because 72 = 0 . Because all elements v^ P € [ 0 , 1 ] and p* > 1 , we can perform 
a change of variables v^ = v , thereby eliminating references to p*: 
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h 

rv m 

3 

4 

5 

6 

7 

Minimum 

0.935537 

0.902161 

0.895671 

0.880487 

0.85343 

Maximum 

1 

1 

1 

1 

1 


Table 1: Exact bounds of t{k m ) for short vectors of length k rn . This 
table shows the results of numerical minimization techniques performed on the 
symbolic closed-form of t(k m ) in Mathematica (using Minimize). All k m — 1 
entries (excluding the first that was set to 1 . 0 ) were left symbolic and constrained 
to [ 0 , 1 ], with restriction that the denominator of t(k m ) was nonzero. 


t(k m ) < 


max (||'i? (m) H2 ||n (m) ||g ll^ (m) llill^ Cm) ll4 

:3i,Vi=l,3j,Vj E(0,1) V 

+ ((Ik (m) ||ill« (m) |l3 - l|v (m) ||lll^ (m) ||4) 2 

- 4 (||n( m )||i||n( m )||l - ||nM||f )(||nM||2|| t; M||4 _ [^MyS 2 )) 
-2(||^)|| 22 -||nM||i||n( m )||i). 



For small vector lengths, the exact bounds of t(k m ) are shown in Table [l] 
Notice that the upper bound is fixed, but the lower bound grows monotonically 
smaller as k m , the length of the vector considered, increases. For larger vectors, 
Mathematica does not find optima in a matter of hours, and for arbitrary- 
length vectors, the Karush-Kuhn-Tucker criteria do not easily yield minima or 
maxima; however, we do observe that all maxima are achieved by vectors that 
are permutations (order does not influence the result) of v = (1, 1 ,... 1 , b, b, .. . b) 
(again, when only two unique values are found in v, the approximation is exact 
and thus — = 1). Likewise, the minima are achieved by permutations of 
v = ( 1 , a, b, b ,... b). For this reason, we perform further empirical estimation of 
the bound by randomly sampling vectors of the form ( 1 , U 2 , V 3 ,... Vk m ) with k m — 
1 degrees of freedom (d.o.f.) and sampling vectors of the form v = ( 1 , a,b,b,... b ) 
(with 2 d.o.f.), whose extrema are shown in Table [ 2 ] 

At length 64 we see that due to an extreme value scenario, an uncon¬ 
strained vector scores slightly lower than a vector holding the worst-case pattern 
(l,a, b,.. . 6 ), because both forms of sampling approach the true lower bound, 
but one of the unconstrained k m — 2 d.o.f. is slightly closer. 

From these results, we conjecture that t is bounded above < 1 (this is achiev¬ 
able at any length k m by letting v contain exactly two distinct values). In this 
manner, we achieve our predicted upper bound of 1 regardless of the length k m - 
Likewise, we conjecture that at any k m (not simply the lengths investigated, 
where this principle is true), the lower bound is given by vectors of the form 
(1, a, b, b,... b). Qualitatively, this conjecture stems from the fact that since 
the estimate is perfect when v contains exactly two distinct elements, then the 
worst-case lower bound when v contains three distinct values will concentrate 
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h 

n "m 

4 

64 

1024 

Minimum ( k m — 1 d.o.f.) 
Maximum (fc m — 1 d.o.f.) 

0.90221268 

0.99999986 

0.74942834 

0.92482416 

0.81858283 

0.86795636 

Minimum (vectors of form (1, a, 6 ,... 
Maximum (vectors of form (1, a, 6 ,... 

. 6 ), 2 d.o.f.) 
b), 2 d.o.f.) 

0.90216688 

1.00000000 

0.75455478 

1.0000000 

0.71695386 

1.00000000 


Table 2: Bounds via random sampling for vectors different in size and 
type. This table shows the minimal and maximal values resulting from the 
evaluation of t(k m ) on 10 5 randomly generated vectors (uniform distribution in 
[0,1]). The first part shows the result for vectors of potentially unconstrained 
composition, besides one (w.l.o.g. the first) being set to 1.0. The values in the 
second half were obtained based on vectors of (supposedly) worst-case compo¬ 
sition (i.e. of form ( 1 , a, 6 ,... b )). 


the points at some value far from the other two distinct values. When four 
distinct values are permitted, then we conjecture that the optimal choice (for 
minimizing t) of the fourth value will equal the choice for the third distinct 
value, since that was already determined to be the best point for deceiving the 
quadratic approximation. From this conjecture, we can then use the fact that 
the bounds should only grow more extreme as k m increases, since I 1 C R 2 C • • • 

(i.e. lower-dimensional solutions can always be reached in a higher dimension 
by setting some of the values to 0). Thus the minimum for any possible vector 
should be conservatively bounded below on vectors of the form ( 1 , a, b, b ,... b) 
and is achieved by letting k m approach oo: 

lim t(k m ) = 

km—> OO 

a 4 b — a 3 b 2 — a 2 b 3 + \Jb' 2 (—a 4 + 3 a 3 b — 3a 2 b 2 + ab 3 + (b — l ) 3 ) 2 + ab 4 +b 4 — b 3 — b 2 + b 

2b ( a 3 — 2 a 2 b + ab 2 + (6 — l) 2 ) 

The minimum value of this expression over all a £ [0,1], b £ [0, f] is 0.704 (com¬ 
puted again with Mathematica). Overall, assuming our conjecture regarding 
the forms of the vectors achieving the minima and maxima, then it follows that 
t £ (0.7,1], and the worst-case relative error at the p^ ax contour will be bounded 

1 4 

|£Pmax — 11 <C 1 — 0.7 p max . 

The steeper decrease in relative error as p* is increased means that the same 
procedure can be used to achieve an absolute error bound: 

a - (i “ 0 . 7 ?") , 
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which achieves a unique maximum at 


1.4267 *log(r) -4.07094 


14.52. 


Pmode 


(log(r) - 2.8534) (log(l - )) 


As before, the worst-case absolute error of the unsealed problem will be found 
by simply scaling the absolute error at p^ 0(je : 



Because p^ 0(ie (the value of p* producing the worst-case absolute error) for the 
null space projection method it is invariant to the length of the list k (enabling 
us to compute a numeric value), and because its numeric value is so small, even a 
fairly small choice ofp^.^ will suffice (nowpJ, ax £ 0(1) rather than in 0(log(fc)) 
as it was with the original piecewise method). For example, the approximation 
of the Viterbi path to infer the unemployment data is slightly superior with the 
null space projection method, even when p* nax = 64 is used (in contrast to the 
Pmax = 8192 used in the Figure [6]). The null space projection method required 
136.6 seconds (slightly faster than the 141.4 seconds required by the original 
piecewise method). 

The one caveat of this worst-case absolute error bound is that it presumes 
at least four evenly spaced, stable p* can be found (which may not be the case 
by choosing p* from the sequence 2 l in cases when llu^Hoo ~ 0); however, 
assuming standard fast convolution can be performed (a reasonable assumption 
given it is one of the essential numeric algorithms), then four evenly spaced p* 
values could be chosen very close to 1; therefore, these values of p* could be 
added to the sequence so that the algorithm is slightly slower, but essentially 
always yield this worst-case absolute error bound. 

In practice, we can demonstrate that the null space projection method is 
very accurate. First we show the impact of using the quadratic (i.e., r = 2) 
projection method on unsealed single u^ vectors. The projection method 
was tested on vectors of different lengths drawn from different types of Beta 
distributions and are compared with the results of the p-norms with the highest 
stable p (Figure [7]). The relative errors between the original piecewise method 
and the null space projection method are compared using a max-convolution on 
two randomly created input PMFs of lengths 1024 (Figure [8]). Note that the 
null space projection can also be paired with affine scaling on the back end, just 
as the original piecewise method can be. In practice, the null space projection 
increases the accuracy demonstrably on a variety of different problems, although 
the original piecewise method also performs well. 

Although the worst-case runtime of the null space projection method is 
roughly 2x that of the original piecewise method, the error bound no longer 
depends on the length of the result k. Thus, for a given relative error bound 
on the top contour (?.e., the equivalent of the derivation of p ^ ax in the original 
piecewise algorithm), the value of p^ ax is fixed and no longer £ 0(log(/c)). For 
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k 

2 6 

2 7 

2 s 

2 9 

2 10 

2 11 

2 12 


Naive 

0.0142 

0.0530 

0.192 

0.767 

3.03 

12.1 

48.2 

Naive (vectorized) 

0.0175 

0.0381 

0.0908 

0.251 

0.790 

2.75 

10.1 

FILL1 (Bussieck et al. 

(1994 

)) 

0.0866 

1.09 

7.21 

19.4 

457 

— 

— 

Max. stable p*, affine corrected 

0.0277 

0.0353 

0.0533 

0.0848 

0.149 

0.274 

0.537 

Projection, affine corrected 

0.0236 

0.0307 

0.0467 

0.0760 

0.137 

0.258 

0.520 


Table 3: Runtimes of different methods for max-convolution on uni¬ 
form vectors of length k. The runtimes were gathered using the timeit pack¬ 
age in Python. They include all preprocessing steps necessary for the algorithm 
( e.g. sorting prior to the FILL1 approach). The values are total runtimes (in 
seconds) to run 5 repetitions on different, randomly generated vectors. FILL1 
was not run on larger problems, because it ran substantially longer than the 
non-vectorized naive approach. On the two approximation methods presented 
in this manuscript, the highest stable p*-norm approximation was run with the 
heuristically chosen p^x f° r problems of the appropriate size and the null space 
projection was run with p * iax = 64. 


example, achieving a 0.5% relative error in the top contour would require 


4 

1 — 0.7 p max 


< 0.005 —> p 


* 

max 


. log(0-7) 

~ log(0.995) 


284.62, 


meaning that choosing p* lax = 512 would achieve a very high accuracy, but while 
only performing 2x9 FFTs. For very large vectors, this will not be substantially 
more expensive than the original piecewise algorithm, which uses a higher value 
of Pm ax (in this case, = logj 005 (A:), which continues to grow as k does) 
to keep the error lower in practice. As a result, the runtime of the null space 
projection approximation is £ 0(klog(k)) rather than 0{k log(fc) log(log(fc))), 
despite the similar runtime in practice to the original piecewise method (the 
null space projection method uses 2x as many FFTs performed per p* value, 
but requires slightly fewer p* values). 


3.5.5 Practical runtime comparison 

To compare the actual runtimes of the final algorithm developed in this 
manuscript with a naive max-convolution and a previously proposed method 
from Bussieck et al. (1994), all methods were run on vectors of different random 
(uniform in [0,1]) composition and length ( k ). The first and second input vector 
were generated seperately but are always of same length. Table [3]shows the re¬ 
sult of this experiment. All methods were implemented in Python, using numpy 
where applicable (e.g. to vectorize). A non-vectorized version of naive max- 
convolution was included to estimate the effects of vectorization. The approach 
from Bussieck et al. ran as a reimplementation based on the pseudocode in 
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Relative errors on Beta distributions 



-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -12 -10 -8 -6 -4 -2 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 


[ Highest stable p* Null space projection |j 

Figure 7: Relative errors on random vectors with and without null 
space projection. For the two approximation methods (using the highest sta¬ 
ble p*-norm with the heuristically chosen or using the null space projection 
with = 64), vectors of different lengths are sampled (2 12 repetitions) from 
a variety of Beta distributions. The settings for the parameters (a, /J) of the 
Beta distribution that were used, as well as the lengths of the generated vectors 
are shown in the titles of the subplots: a = 0.5,/3 = 0.5 (bimodal with modes 
near zero and one); a = 0.1, (3 = 0.1 (uniform distribution); a — 10, /3 = 0.25 
(with a strong mode near one). The red area depicts the frequencies (y-axis) of 
the different magnitudes of (relative) error (x-axis) when using the highest sta¬ 
ble p*-norm is used as an approximation of the Chebyshev norm (p = oo). The 
blue area shows the errors with the method that performs a projection (either 
quadratic or linear depending on how many numerically stable p* are available) 
to estimate the Chebyshev norm. 
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Figure 8: Relative errors on large max-convolution with and without 
null space projection. Max-convolution between two randomly generated 
vectors (both uniform vectors convolved with narrow Gaussians with uniform 
noise added afterward), performed with the highest stable p*-norm (using the 
heuristic choice of p* nax for problems of this size) and with null space projection 
(using = 64). The left y-axis shows the relative error at index m. Asso¬ 
ciated with that, you can see the red and blue curve depicting the errors from 
the two different methods: Red describes the max-norm estimation using only 
the highest stable p* while purple was generated using quadratic projection at 
the four highest stable p* values (when at least four evenly spaced values are 
numerically stable) and linear projection at the two highest stable p* values 
(when only two p* are numerically stable). The results of both approaches are 
corrected with the affine transformation method proposed in this manuscript. 
In the background the gray shaded curve shows the exact result of the max- 
convolution at every index (to be used with the second y-axis on the right). 
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their manuscript. From their variants of proposed methods, FILL1 was chosen 
because of its use in their corresponding benchmark and its recommendation by 
the authors for having a lower runtime constant in practice compared to other 
methods they proposed. The method is based on sorting the input vectors and 
traversing the (implicitly) resulting partially ordered matrix of products in a 
way that not all entries need to be evaluated, while only keeping track of the 
so-called cover of maximal elements. FILL1 already includes some more sophis¬ 
ticated checks to keep the cover small and thereby reducing the overhead per 
iteration. Unfortunately, although we observed that the FILL1 method requires 
between 0(n log(n)) and 0(n 2 ) iterations in practice, this per-iteration overhead 
results in a worst-case cost of log(n) per iteration, yielding an overall runtime 
in practice between 0(n log(ra) log(n)) and 0{n 2 log(ra)). As the authors state, 
this overhead is due to the expense of storing the cover, which can be imple¬ 
mented e.g. using a binary heap (recommended by the authors and used in this 
reimplementation). Additionally, due to the fairly sophisticated datastructures 
needed for this algorithm it had a higher runtime constant than the other meth¬ 
ods presented here, and furthermore we saw no means to vectorize it to improve 
the efficiency. For this reason, it is not truly fair to compare the raw runtimes 
to the other vectorized algorithms (and it is not likely that this Python reim¬ 
plementation is as efficient as the original version, which Bussieck et al. (19941 
implemented in ANSI-C); however, comparing a non-vectorized implementation 
of the naive 0(n 2 ) approach with its vectorized counterpart gives an estimated 
~ 5x speedup from vectorization, suggesting that it is not substantially faster 
than the naive approach on these problems (it should be noted that whereas 
the methods presented here have tight runtime bounds but produce approx¬ 
imate results, the FILL1 algorithm is exact, but its runtime depends on the 
data processed). During investigation of these runtimes, we found that on the 
given problems, the proposed average case of 0(?rlog(n)) iterations was rarely 
reached. A reason might be an unrecognized violation of the assumptions of the 
theory behind this theoretical average runtime in how the input vectors were 
generated. 

In contrast to the exact method from Bussieck et al. (1994), the herein pro¬ 
posed approximate procedure are faster whenever the input vectors are at least 
128 elements long (shorter vectors are most efficiently processed with the naive 
approach). The null space projection method is the fastest method presented 
here (because it can use a lower p^ ax ), although the higher density of p* values 
it uses (and thus, additional FFTs) make the runtimes nearly identical for both 
approximation methods. 


4 Discussion 


Both piecewise numerical max-convolution methods are highly accurate in prac¬ 
tice and achieve a substantial speedup over both the naive approach and 
the approach proposed by Bussieck et al. (1994). This is particularly true 
for large problems: For the original piecewise method presented here, the 
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log 2 (log i i(fc)) multiplier may never be small, but it grows so slowly with 
k that it will be < 18 even when k is on the same order of magnitude as the 
number of particles in the observable universe. This means that, for all practical 
purposes, the method behaves asymptotically as a slightly slower 0(A; log 2 (fc)) 
method, which means the speedup relative to the naive method becomes more 
pronounced as k becomes large. For the second method presented (the null space 
projection), the runtime for a given relative error bound will be in 0(fclog 2 (fc)). 
In practice, both methods have similar runtime on large problems. 

The basic motivation of the first approach described- i.e., the idea of approx¬ 
imating the Chebyshev norm with the largest p*-norm that can be computed 
accurately, and then convolving according to this norm using FFT also sug¬ 
gests further possible avenues of research. For instance, it may be possible to 
compute a single FFT (rather than an FFT at each of several contours) on a 
more precise implementation of complex numbers. Such an implementation of 
complex values could store not only the real and imaginary components, but 
also other much smaller real and imaginary components that have been accu¬ 
mulated through + operations, even those which have small enough magnitudes 
that they are dwarfed by other summands. With such an approach it would 
be possible to numerically approximate the max-convolution result in the same 
overall runtime as long as only a bounded “history” of such summands was 
recorded (i.e., if the top few magnitude summands—whether that be the top 7 
or the top log 2 (log 1 (k ))—was stored and operated on). In a similar vein, it 
would be interesting to investigate the utility of complex values that use rational 
numbers (rather than fixed-precision floating point values), which will be highly 
precise, but will increase in precision (and therefore, computational complexity 
of each arithmetic operation) as the dynamic range between the smallest and 
largest nonzero values in L and R increases (because taking L' to a large power 
p* may produce a very small value). Other simpler improvements could include 
optimizing the error vs. runtime trade-off between the log-base of the contour 
search: the method currently searches log 2 (Pm ax ) contours, but a smaller or 
larger log-base could be used in order to optimize the trade-off between error 
and runtime. 

It is likely that the best trade-off will occur by performing the fast p*-norm 
convolution with a number type that sums values over vast dynamic ranges 
by appending them in a short (i.e., bounded or constant size) list or tree and 
sums values within the same dynamic range by querying the list or tree and 
then summing in at the appropriate magnitude. This is reminiscent of the fast 
multipole algorithm (Rokhlin, 1985). This would permit the method to use a 
single large p* rather than a piecewise approach, by moving the complexity into 
operations on a single number rather than by performing multiple FFTs with 
simple floating-point numbers. 

The basic motivation of the second approach described- i.e., using the se¬ 
quence of p*-norms (each computed via FFT) to estimate the maximum value— 
generalizes the p*-norm fast convolution numerical approach into an interesting 
theoretical problem in its own right: given an oracle that delivers a small num- 
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ber of norms (the number of norms retrieved must be c £ o(k) to significantly 
outperform the naive quadratic approach) about each vector u ( m ' > , amalgamate 
these norms in an efficient manner to estimate the maximum value in each u^ ■ 
This method may be applicable to other problems, such as databases where the 
maximum values of some combinatorial operation (in this case the maximum a 
posteriori distribution of the sum of two random variables X + Y) is desired but 
where caching all possible queries and their maxima would be time or space pro¬ 
hibitive. In a manner reminiscent of how we employ FFT, it may be possible to 
retrieve moments of the result of some combinatoric combination between distri¬ 
butions on the fly, and then use these moments to approximate true maximum 
(or, in general, other sought quantities describing the distribution of interest). 

In practice, the worst-case relative error of our quadratic approximation 
is quite low. For example, when p* = 8 is stable, then the relative error is 
less than 2.3%, regardless of the lengths of the vectors being max-convolved. 
In contrast, the worst-case relative error using the original piecewise method 
would be < fcis — 1, where k is the length of the max-convolution result (when 
n — 1024, the relative error of the original piecewise method would be « 54%). 

Of course, the use of the null space projection method is predicated on the 
existence of at least four sequential p* points, but it would be possible to use 
finer spacing between p* values ( e.g ., p* € (1,1.01,1.02,1.03) to guarantee that 
this will essentially be the case as long as FFT (i.e. p* = 1) is stable. But more 
generally, the problem of estimating extrema from p*-norms (or, equivalently, 
from the p*-th roots of the p*-th moments of a distribution with bounded sup¬ 
port), will undoubtedly permit many more possible approaches that we have not 
yet considered. One that would be compelling is to relate the Fourier transform 
of the sequential moments to the maximum value in the distribution; such an 
approach could permit all stable p* at any index rn to be used to efficiently 
approximate the maximum value (by computing the FFT of the sequence of 
norms). Such new adaptations of the method could permit low worst-case error 
without any noticable runtime increase. 

4.1 Multidimensional Numerical Max-Convolution 

The fast numerical piecewise method for max-convolution (and the affine piece- 
wise modification) are both applicable to matrices as well as vectors (and, most 
generally, to tensors of any dimension). This is because the p*-norm (as well 
as the derived error bounds as an approximation of the Chebyshev norm) can 
likewise approximate the maximum element in the tensor ) generated 

to find the max-convolution result at index ?ni,TO2 ,... of a multidimensional 
problem, because the sum 

E (“<”C ! '■’)'” 

computed by convolution corresponds to the Frobenius norm (i.e. the “entry- 
wise norm”) of the tensor, and after taking the result of the sum to the power 
—;r, will converge to the maximum value in the tensor (if p* is large enough). 
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This means that the fast numerical approximation, including the affine piece- 
wise modification, can be used without modification by invoking standard multi¬ 
dimensional convolution (i.e., *). Matrix (and, in general, tensor) convolution is 
likewise possible for any dimension via the row-column algorithm, which trans¬ 
forms the FFT of a matrix into sequential FFTs on each row and column. The 
accompanying Python code demonstrates the fast numerical max-convolution 
method on matrices, and the code can be run on tensors of any dimension 
(without requiring any modification). 

The speedup of FFT tensor convolution (relative to naive convolution) be¬ 
comes considerably higher as the dimension of the tensors increases; for this 
reason, the speedup of fast numerical max-convolution becomes even more pro¬ 
nounced as the dimension increases. For a tensor of dimension d and width k 
(i.e., where the index bounds of every dimension are € {0,1 ,... fc— 1}), the cost 
of naive max-convolution will be in 0(k 2d ), whereas the cost of numerical max- 
convolution is 0(k d logo(fc)) (ignoring the log 2 (log i (k)) < 18 multiplier), 

meaning that there is an 0( d Io g ^ ) speedup from the numerical approach. 
Examples of such tensor problems include graph theory, where adjacency ma¬ 
trix representation can be used to describe respective distances between nodes 
in a network. 

As a concrete example, the demonstration Python code computes the max- 
convolution between two 256 x 256 matrices. The naive method required 494 sec¬ 
onds, but the numerical result with the original piecewise method was computed 
in 3.18 seconds (yielding a maximum absolute error of 0.0173 and a maximum 
relative error of 0.0511) and the numerical result with the null space projection 
method was computed in 3.99 seconds (using p* nax = 512, which corresponds 
to a relative error of < 0.1% in the top contour, yielding a maximum absolute 
error of 0.0141 and a maximum relative error of 0.0227) and in 3.05 seconds 
(using Pmax = 64, which corresponds to a relative error of < 2.5% in the top 
contour, yielding a maximum absolute error of 0.0667 and a maximum relative 
error of 0.067). Not only does the speedup of the proposed methods relative 
to naive max-convolution increase significantly as the dimension of the tensor 
is increased, no other faster-tlran-naive algorithms exist for max-convolution of 
matrices or tensors. 

Multidimensional max-convolution can likewise be applied to hidden Markov 
models with additive transitions over multidimensional variables ( e.g ., allowing 
the latent variable to be a two-dimensional joint distribution of American and 
German unemployment with a two-dimensional joint transition probability). 

4.2 Max-Deconvolution 

The same p*-norm approximation can also be applied to the problem of nrax- 
deconvolution (i.e., solving M = L* max R for R when given M and L). This can 
be accomplished by computing the ratio of FFT(M P ) to FFT(L P ) (assuming 
L has already been properly zero-padded), and then computing the inverse FFT 
of the result to approximate R p ; however, it should be noted that deconvolution 
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methods are typically less stable than the corresponding convolution methods, 
computing a ratio is less stable than computing a product (particularly when 
the denominator is close to zero). 

4.3 Amortized Argument for Low MSE of the Affine 
Piecewise Method 

Although the largest absolute error of the affine piecewise method is the same 
as the largest absolute error of the original piecewise method, the mean squared 
error (MSE) of the affine piecewise method will be lower than the square of the 
worst-case absolute error. 

To achieve the worst-case absolute error for a given contour the affine cor¬ 
rection must be negligible; therefore, there must be two nearly vertical points 
on the scatter plot of ||u( mi )|| 00 vs. ||?A mi ) || p *, which are both extremes of the 
bounding envelope from Figure |3l Thus, there must exist two different indices 
mi and m2 with vectors where ||-uA mi )|| p . ~ ||oo and where 

ll« (m2) llr* ~ \\u {m2) \\ooki 2 

(creating two vertical points on the scatter plot, and forcing that both cannot 
simultaneously be corrected by a single affine mapping). In order to do this, it 
is required to have u^ mi ^ filled with a single nonzero value and for the remaining 
elements of u^ mi ^ to equal zero. Conversely, u,( m2 ' > must be filled entirely with 
large, nonzero values (the largest values possible that would still use the same 
contour p*). Together, these two arguments place strong constraints on the vec¬ 
tors L' and R' (and transitively, also constrains the unsealed vectors L and R): 
On one hand, filling with k mi — 1 zeros requires that k mi — 1 elements from 
either L or R must be zero (because at least one factor must be zero to achieve 
a product of zero). On the other hand, filling with all large-value nonze¬ 

ros requires that k m2 elements of both L and R are nonzero. Together, these 
requirements stipulate that both k mi — 1 + k m2 < k , because entries of L and R 
cannot simultaneously be zero and nonzero. Therefore, in order to have many 
such vertical points, constrains the lengths of the u^ mi \u < ' rri2 \ ... vectors 

corresponding to those points. While the worst-case absolute error bound pre¬ 
sumes that an individual vector u may have length k, this will not be possible 
for many vectors corresponding to vertical points on the scatter plot. For this 
reason, the MSE will be significantly lower than the square of the worst-case ab¬ 
solute error, because making a high affine-corrected absolute error on one index 
necessitates that the absolute errors at another index cannot be the worst-case 
absolute error (if the sizes of L and R are fixed). 

5 Availability 

Code for exact max-convolution and the fast numerical method (which 
includes both || • || p * and null space projection methods) is imple¬ 
mented in Python and available at https://bitbucket.org/orserang/ 
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fast-numerical-max-convolution All included code works for numpy arrays 
of any dimension, i.e. tensors). 
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